NOTE: skip this section if you are not running R locally (e.g., if you are running R in your browser using a remote Jupyter server)
You should have R installed –if not:
Download workshop materials:
R is a programming language designed for statistical computing. Notable characteristics include:
OK, it’s free and popular, but what makes R worth learning? In a word, “packages”. If you have a data manipulation, analysis or visualization task, chances are good that there is an R package for that. Lets install some packages and look at some examples.
## install.packages(c("ggmap", "plotly", "rgl", "forecast"))library(ggmap)
nwbuilding <- geocode("1737 Cambridge Street Cambridge, MA 02138", source = "google") ## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=1737%20Cambridge%20Street%20Cambridge,%20MA%2002138&sensor=false
ggmap(get_map("Cambridge, MA", zoom = 15)) +
geom_point(data=nwbuilding, size = 7, shape = 13, color = "red")## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Cambridge,+MA&zoom=15&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Cambridge,%20MA&sensor=false
library(forecast)
library(plotly)
## from https://esa.un.org/unpd/wpp/Download/Standard/Population/
worldpop <- structure(c(2.525149312, 2.571867515, 2.617940399, 2.66402901,
2.710677773, 2.758314525, 2.807246148, 2.85766291, 2.909651396,
2.963216053, 3.018343828, 3.075073173, 3.133554362, 3.194075347,
3.256988501, 3.322495121, 3.390685523, 3.461343172, 3.533966901,
3.607865513, 3.682487691, 3.757734668, 3.833594894, 3.90972212,
3.985733775, 4.061399228, 4.13654207, 4.211322427, 4.286282447,
4.362189531, 4.439632465, 4.518602042, 4.599003374, 4.681210508,
4.765657562, 4.852540569, 4.942056118, 5.033804944, 5.126632694,
5.218978019, 5.309667699, 5.398328753, 5.485115276, 5.57004538,
5.653315893, 5.735123084, 5.815392305, 5.894155105, 5.971882825,
6.049205203, 6.126622121, 6.204310739, 6.282301767, 6.360764684,
6.439842408, 6.51963585, 6.600220247, 6.68160732, 6.763732879,
6.846479521, 6.92972504300001, 7.013427052, 7.097500453, 7.181715139,
7.265785946, 7.349472099), .Tsp = c(1950, 2015, 1), class = "ts")
## Projected numbers (in billions) of humans living on earth
fit <- auto.arima(worldpop)
ggplotly(autoplot(forecast(fit)))comet <- rgl::readOBJ(url("http://sci.esa.int/science-e/www/object/doc.cfm?fobjectid=54726"))
plot_ly(x = comet$vb[1,],
y = comet$vb[2,],
z = comet$vb[3,],
i = comet$it[1,]-1,
j= comet$it[2,]-1,
k = comet$it[3,]-1,
type = "mesh3d")Whatever you’re trying to do, you’re probably not the first to try doing it R. Chances are good that someone has already written a package for that.
Coming from…
The old-school way is to run R directly in a terminal
But hardly anybody does it that way anymore! The Windows version of R comes with a GUI that looks like this:
The default windows GUI is not very good
RStudio (an alternative GUI for R) is shown below.
Rstudio has many useful features, including parentheses matching and auto-completion. Rstudio is not the only advanced R interface; other alteratives include Emacs with ESS (shown below).
Emacs + ESS is a very powerful combination, but can be difficult to set up.
Jupyter is a notebook interface that runs in your web browser. A lot of people like it. You can access these workshop notes as a Jupyter notebook at http://tutorials-live.iq.harvard.edu:8000/notebooks/workshops/R/Rintro/Rintro.ipynb
Note: skip this section if you are not using Rstudio (e.g., if you are running these examples in a Jupyter notebook).
Rintro.R script in the Rintro folder on your desktop# is a comment that will be ignored by R. My comments all start with ##; you can add your own, possibly using # or ### to distinguish your comments from mine.Now that we know what we’re getting into and have our environment set up, let’s get to work.
The purpose of this exercise is mostly to give you an opportunity to explore the interface provided by RStudio (or whichever GUI you’ve decided to use). You may not know how to do these things; that’s fine! This is an opportunity to learn. If you don’t know how to do something you can can use internet search engines, search on StackOverflow, or ask the person next to you.
Also keep in mind that we are living in a golden age of tab completion. If you don’t know the name of an R function, try guessing the first two or three letters and pressing TAB. If you guessed correctly the function you are looking for should appear in a pop up!
car. Try to install this package.## 1. 2 plus 2
2 + 2## [1] 4
## or
sum(2, 2)## [1] 4
## 2. square root of 10:
sqrt(10)## [1] 3.162278
## or
10^(1/2)## [1] 3.162278
## 3. Install the "car" package:
## In Rstudio, go to the "Packages" tab and click the "Istall" button.
## Search in the pop-up window and click "Install".
## Alternatively, use the `install.packages` function like this:
install.packages("car")## Installing package into '/home/izahn/R/x86_64-pc-linux-gnu-library/3.3'
## (as 'lib' is unspecified)
## 4. Find "An Introduction to R".
## Go to the main help page by running 'help.start() or using the GUI
## menu, find and click on the link to "An Introduction to R".
## 5. Go to <http://cran.r-project.org/web/views/> and skim the topic
## closest to your field/interests.
## I like the machine learning topic.I would like to know what the most popular baby names are. In the course of answering this question we will learn to call R functions, install and load packages, assign values to names, read and write data, and more.
The examples in this workshop use the baby names data provided by the governments of the United States and the United Kingdom. A cleaned and merged version of these data is in dataSets/babyNames.csv.
Our first goal is to read these data into R. In order to do that we need to learn how to call functions, install packages, set out working directory, read as .csv file, and assign the result to a name. Lets get to it.
There are thousands of R packages that extend R’s capabilities. Some packages are distributed with R, and some of these are attached to the search path by default. Many more are available in package repositories.
In order to make reading and analyzing our baby names data easier we will install and use a collection of packages called tidyverse. tidyverse is a meta package that loads the dplyr package for easier data manipulation the readr package for easier data import/export, and several other useful packages.
Packages can be installed using the install.packages function.
The general form for calling R functions is
## FunctionName(arg.1 = value.1, arg.2 = value.2, ..., arg.n - value.n)Arguments can be matched by position or name. Lets see how that works, using the install.packages function.
Since this is the first time we are using the install.packages function we will start by looking up its help page. This is almost always the first thing you should do when using a function for the first time. You can look up the help page for a function like this:
?install.packagesAs we can see from the documentation, the first (and only required) argument is named pkgs. Additional arguments specify where this package should be installed from (repos) and to (lib) among other things.
OK, lets install the “car” package from the repo at “https://cran.rstudio.com”.
install.packages("car", repos = "https://cran.rstudio.com")## Installing package into '/home/izahn/R/x86_64-pc-linux-gnu-library/3.3'
## (as 'lib' is unspecified)
Installing a package puts a copy of the package on your local computer, but does not make it available for use. To use an installed package you must attach it using the library function.
library("car")Now that we’ve installed the car package, how do we use it? We’ve already seen that we can look up the help page using ?. This is actually a shortcut to the help function:
help(help)The help function can be used to look up the documentation for a function, or to look up the documentation to a package. We can learn how to use the car package by reading its documentation like this:
help(package = "car")The purpose of this exercise is to practice using the package management and help facilities.
tidyverse package.library function to attach the tidyverse package..csv) file?## 1. install the tidyverse pacakge
install.packages("tidyverse")## Installing package into '/home/izahn/R/x86_64-pc-linux-gnu-library/3.3'
## (as 'lib' is unspecified)
## 2. attach the tidyverse pacakge
library("tidyverse")
## 3. look up the readr package documentation
help(package = "readr")
## I would use read_tsv to read a tab delimited file.Now that we have installed and attached the tidyverse (and readr) packages, and know which function to use to read our data (read_csv) we are almost ready to read in the baby names data. Before we do that lets take a small excision to learn about assignment and basic data types in R.
Values can be assigned names and used in subsequent operations
<- operator (less than followed by a dash) is used to save valuesx <- 10 # Assign the value 10 to a variable named x
x + 1 # Add 1 to x## [1] 11
x # note that x is unchanged## [1] 10
y <- x + 1 # Assign y the value x + 1
y## [1] 11
x <- x + 100 # change the value of x
y ## note that y is unchanged.## [1] 11
The x and y data objects we created are numeric vectors of length one. Vectors are the simplest data structure in R, and are the building blocks used to make more complex data structures. Here are some more vector examples.
x <- c(10, 11, 12)
y <- c("10", "11", "12")
z <- c(TRUE, FALSE, TRUE, TRUE)Notice that the c function combines its arguments into a vector.
All R objects have a type (aka mode) and length. Since it is impossible for an object not to have these attributes they are called intrinsic attributes. They can be retrieved using the typeof and length functions.
c(x = x, type = typeof(x), length = length(x))## x1 x2 x3 type length
## "10" "11" "12" "double" "3"
c(y = y, type = typeof(y), length = length(y))## y1 y2 y3 type length
## "10" "11" "12" "character" "3"
c(z = z, type = typeof(z), length = length(z))## z1 z2 z3 z4 type length
## "TRUE" "FALSE" "TRUE" "TRUE" "logical" "4"
Data structures in R can be converted from one type to another using one of the many functions beginning with as.. For example:
typeof(x)## [1] "double"
typeof(as.character(x))## [1] "character"
typeof(y)## [1] "character"
typeof(as.numeric(y))## [1] "double"
These vectors (double, character, logical) are called atomic vectors because each element must be of the same type. Given inputs with conflicting types R will convert them for you.
typeof(c(1, 2))## [1] "double"
typeof(c(1, "2"))## [1] "character"
Now that we know how to do assignment using <- and how to understand basic data types in R we are finally ready to read in the baby names data.
R knows the directory it was started in, and refers to this as the “working directory”. Since our workshop examples are in the Rintro folder, we should all take a moment to set that as our working directory.
getwd() # what is my current working directory?
# setwd("~/Desktop/Rintro") # change directoryNote that “~” means “my home directory” but that this can mean different things on different operating systems. You can also use the Files tab in Rstudio to navigate to a directory, then click “More -> Set as working directory”.
We can a set the working directory using paths relative to the current working directory. Once we are in the “Rintro” folder we can navigate to the “dataSets” folder like this:
getwd() # get the current working directory## [1] "/home/izahn/Documents/Work/Classes/IQSS_Stats_Workshops/R/Rintro"
setwd("dataSets") # set wd to the dataSets folder
getwd()## [1] "/home/izahn/Documents/Work/Classes/IQSS_Stats_Workshops/R/Rintro/dataSets"
setwd("..") # set wd to enclosing folder ("up")
getwd()## [1] "/home/izahn/Documents/Work/Classes/IQSS_Stats_Workshops/R/Rintro"
It can be convenient to list files in a directory without leaving R
list.files("dataSets") # list files in the dataSets folder## [1] "babyNames.csv"
In order to read data from a file, you have to know what kind of file it is. The table below lists the functions that can import data from common file formats.
| data type | function | package |
|---|---|---|
| comma separated (.csv) | read_csv() |
readr (tidyverse) |
| other delimited formats | read_delim() |
readr (tidyverse) |
| R (.Rds) | read_rds() |
readr (tidyverse) |
| Stata (.dta) | read_stata() |
haven (tidyverse, needs to be attached separately) |
| SPSS (.sav) | read_spss() |
haven (tidyverse, needs to be attached separately) |
| SAS (.sas7bdat) | read_sas() |
haven (tidyverse, needs to be attached separately) |
| Excel (.xls, .xlsx) | read_excel() |
readxl (tidyverse, needs to be attached separately) |
The purpose of this exercise is to practice reading data into R. The data in “dataSets/babyNames.csv” is moderately tricky to read, making it a good data set to practice on.
read_csv function. How can you limit the number of rows to be read in?dataSets/babyNames.csv”. Notice that the “Sex” column has been read as a logical (TRUE/FALSE).read_csv help page to figure out how to make it read the “Sex” column as a character. Make adjustments to your code until you have read in the first 10 rows with the correct column types. “Year” and “Name.length” should be integer (int), “Count” and “Percent” should be double (dbl) and everything else should be character (chr).baby.names.## read ?read_csv
## limit rows with n_max argument
read_csv("dataSets/babyNames.csv", n_max = 10)## Parsed with column specification:
## cols(
## Location = col_character(),
## Year = col_integer(),
## Sex = col_logical(),
## Name = col_character(),
## Count = col_double(),
## Percent = col_double(),
## Name.length = col_integer()
## )
## specify column types in the col_types argument
read_csv("dataSets/babyNames.csv", n_max = 10, col_types = "??c????")
## read all the data
baby.names <- read_csv("dataSets/babyNames.csv", col_types = "??c????")It is always a good idea to examine the imported data set–usually we want the results to be a data.frame
## we know that this object will have type and length, because all R objects do.
typeof(baby.names)## [1] "list"
length(baby.names) # number of columns## [1] 7
## additional information about this data object
class(baby.names) # check to see that test is a data.frame## [1] "tbl_df" "tbl" "data.frame"
dim(baby.names) # how many rows and columns?## [1] 1966001 7
names(baby.names) # or colnames(baby.names)## [1] "Location" "Year" "Sex" "Name" "Count"
## [6] "Percent" "Name.length"
str(baby.names) # more details## Classes 'tbl_df', 'tbl' and 'data.frame': 1966001 obs. of 7 variables:
## $ Location : chr "England and Wales" "England and Wales" "England and Wales" "England and Wales" ...
## $ Year : int 1996 1996 1996 1996 1996 1996 1996 1996 1996 1996 ...
## $ Sex : chr "F" "F" "F" "F" ...
## $ Name : chr "sophie" "chloe" "jessica" "emily" ...
## $ Count : num 7087 6824 6711 6415 6299 ...
## $ Percent : num 2.39 2.31 2.27 2.17 2.13 ...
## $ Name.length: int 6 5 7 5 6 6 9 7 3 5 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 7
## .. ..$ Location : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Year : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Sex : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Name : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Count : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Percent : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Name.length: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
glimpse(baby.names) # details, more compactly## Observations: 1,966,001
## Variables: 7
## $ Location <chr> "England and Wales", "England and Wales", "England...
## $ Year <int> 1996, 1996, 1996, 1996, 1996, 1996, 1996, 1996, 19...
## $ Sex <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", ...
## $ Name <chr> "sophie", "chloe", "jessica", "emily", "lauren", "...
## $ Count <dbl> 7087, 6824, 6711, 6415, 6299, 5916, 5866, 5828, 52...
## $ Percent <dbl> 2.3942729, 2.3054210, 2.2672450, 2.1672444, 2.1280...
## $ Name.length <int> 6, 5, 7, 5, 6, 6, 9, 7, 3, 5, 7, 5, 7, 4, 4, 7, 5,...
Usually data read into R will be stored as a data.frame
A data.frame has two dimensions corresponding the number of rows and the number of columns (in that order)
You can extract subsets of data.frames using slice to select rows by number and filter to select rows that match some condition. It works like this:
## make up some example data
(example.df <- data.frame(id = rep(letters[1:4], each = 4),
t = rep(1:4, times = 4),
var1 = runif(16),
var2 = sample(letters[1:3], 16, replace = TRUE)))## id t var1 var2
## 1 a 1 0.8843358 c
## 2 a 2 0.5557704 b
## 3 a 3 0.2560863 b
## 4 a 4 0.1324027 a
## 5 b 1 0.4816932 a
## 6 b 2 0.6519777 b
## 7 b 3 0.3590353 a
## 8 b 4 0.6234120 a
## 9 c 1 0.9010314 c
## 10 c 2 0.5467439 a
## 11 c 3 0.8115376 a
## 12 c 4 0.2341479 a
## 13 d 1 0.9571786 b
## 14 d 2 0.1186927 c
## 15 d 3 0.3832028 a
## 16 d 4 0.7507328 a
## rows 2 and 4
slice(example.df, c(2, 4))## id t var1 var2
## 1 a 2 0.5557704 b
## 2 a 4 0.1324027 a
## rows where id == "a"
filter(example.df, id == "a")## id t var1 var2
## 1 a 1 0.8843358 c
## 2 a 2 0.5557704 b
## 3 a 3 0.2560863 b
## 4 a 4 0.1324027 a
## rows where id is either "a" or "b"
filter(example.df, id %in% c("a", "b"))## id t var1 var2
## 1 a 1 0.8843358 c
## 2 a 2 0.5557704 b
## 3 a 3 0.2560863 b
## 4 a 4 0.1324027 a
## 5 b 1 0.4816932 a
## 6 b 2 0.6519777 b
## 7 b 3 0.3590353 a
## 8 b 4 0.6234120 a
slice and filter are used to extract rows. select is used to extract columns
select(example.df, id, var1)## id var1
## 1 a 0.8843358
## 2 a 0.5557704
## 3 a 0.2560863
## 4 a 0.1324027
## 5 b 0.4816932
## 6 b 0.6519777
## 7 b 0.3590353
## 8 b 0.6234120
## 9 c 0.9010314
## 10 c 0.5467439
## 11 c 0.8115376
## 12 c 0.2341479
## 13 d 0.9571786
## 14 d 0.1186927
## 15 d 0.3832028
## 16 d 0.7507328
select(example.df, id, t, var1)## id t var1
## 1 a 1 0.8843358
## 2 a 2 0.5557704
## 3 a 3 0.2560863
## 4 a 4 0.1324027
## 5 b 1 0.4816932
## 6 b 2 0.6519777
## 7 b 3 0.3590353
## 8 b 4 0.6234120
## 9 c 1 0.9010314
## 10 c 2 0.5467439
## 11 c 3 0.8115376
## 12 c 4 0.2341479
## 13 d 1 0.9571786
## 14 d 2 0.1186927
## 15 d 3 0.3832028
## 16 d 4 0.7507328
You can also conveniently select a single column using $, like this:
example.df$t## [1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
Data manipulation commands can be combined:
filter(select(example.df,
id,
var1),
id == "a")## id var1
## 1 a 0.8843358
## 2 a 0.5557704
## 3 a 0.2560863
## 4 a 0.1324027
In the previous example we used == to filter rows where id was “a”. Other relational and logical operators are listed below.
| Operator | Meaning |
|---|---|
| == | equal to |
| != | not equal to |
| > | greater than |
| >= | greater than or equal to |
| < | less than |
| <= | less than or equal to |
| %in% | contained in |
| & | and |
| | | or |
You can modify data.frames using the mutate() function. It works like this:
example.df## id t var1 var2
## 1 a 1 0.8843358 c
## 2 a 2 0.5557704 b
## 3 a 3 0.2560863 b
## 4 a 4 0.1324027 a
## 5 b 1 0.4816932 a
## 6 b 2 0.6519777 b
## 7 b 3 0.3590353 a
## 8 b 4 0.6234120 a
## 9 c 1 0.9010314 c
## 10 c 2 0.5467439 a
## 11 c 3 0.8115376 a
## 12 c 4 0.2341479 a
## 13 d 1 0.9571786 b
## 14 d 2 0.1186927 c
## 15 d 3 0.3832028 a
## 16 d 4 0.7507328 a
## modify example.df and assign the modified data.frame the name example.df
example.df <- mutate(example.df,
var2 = var1/t, # replace the values in var2
var3 = 1:length(t), # create a new column named var3
var4 = factor(letters[t]),
t = NULL # delete the column named t
)## examine our changes
example.df## id var1 var2 var3 var4
## 1 a 0.8843358 0.88433578 1 a
## 2 a 0.5557704 0.27788521 2 b
## 3 a 0.2560863 0.08536211 3 c
## 4 a 0.1324027 0.03310067 4 d
## 5 b 0.4816932 0.48169321 5 a
## 6 b 0.6519777 0.32598887 6 b
## 7 b 0.3590353 0.11967844 7 c
## 8 b 0.6234120 0.15585300 8 d
## 9 c 0.9010314 0.90103143 9 a
## 10 c 0.5467439 0.27337196 10 b
## 11 c 0.8115376 0.27051253 11 c
## 12 c 0.2341479 0.05853698 12 d
## 13 d 0.9571786 0.95717862 13 a
## 14 d 0.1186927 0.05934635 14 b
## 15 d 0.3832028 0.12773428 15 c
## 16 d 0.7507328 0.18768320 16 d
Now that we have made some changes to our data set, we might want to save those changes to a file.
# write data to a .csv file
write_csv(example.df, path = "example.csv")
# write data to an R file
write_rds(example.df, path = "example.rds")
# write data to a Stata file
library(haven)
write_dta(example.df, path = "example.dta")In addition to importing individual datasets, R can save and load entire workspaces
ls() # list objects in our workspace## [1] "baby.names" "comet" "example.df" "fit" "nwbuilding"
## [6] "worldpop" "x" "y" "z"
save.image(file="myWorkspace.RData") # save workspace
rm(list=ls()) # remove all objects from our workspace
ls() # list stored objects to make sure they are deleted## character(0)
Load the “myWorkspace.RData” file and check that it is restored
load("myWorkspace.RData") # load myWorkspace.RData
ls() # list objects## [1] "baby.names" "comet" "example.df" "fit" "nwbuilding"
## [6] "worldpop" "x" "y" "z"
Read in the “babyNames.csv” file if you have not already done so, assigning the result to baby.names.
baby.names to show only names given to at least 5 percent of boys.baby.names to include only names given to at least 3 percent of Girls. Save this to a Stata data set named “popularGirlNames.dta”)filter(baby.names, Sex == "M" & Percent >= 5)## # A tibble: 0 × 7
## # ... with 7 variables: Location <chr>, Year <int>, Sex <chr>, Name <chr>,
## # Count <dbl>, Percent <dbl>, Name.length <int>
baby.names <- mutate(baby.names, Proportion = Percent/100)
popular.girl.names <- filter(baby.names, Sex == "F" & Percent >= 3)
write_csv(popular.girl.names, path = "popularGirlNames.dta")Descriptive statistics of single variables are straightforward:
sum(example.df$var1) # calculate sum of var 1## [1] 8.647981
mean(example.df$var1)## [1] 0.5404988
median(example.df$var1)## [1] 0.5512572
sd(example.df$var1) # calculate standard deviation of var1## [1] 0.2756081
var(example.df$var1)## [1] 0.07595985
## summaries of individual columns
summary(example.df$var1)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1187 0.3333 0.5513 0.5405 0.7659 0.9572
summary(example.df$var2)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0331 0.1111 0.2291 0.3250 0.3649 0.9572
## summary of whole data.frame
summary(example.df)## id var1 var2 var3 var4
## a:4 Min. :0.1187 Min. :0.0331 Min. : 1.00 a:4
## b:4 1st Qu.:0.3333 1st Qu.:0.1111 1st Qu.: 4.75 b:4
## c:4 Median :0.5513 Median :0.2291 Median : 8.50 c:4
## d:4 Mean :0.5405 Mean :0.3250 Mean : 8.50 d:4
## 3rd Qu.:0.7659 3rd Qu.:0.3649 3rd Qu.:12.25
## Max. :0.9572 Max. :0.9572 Max. :16.00
Some of these functions (e.g., summary) will also work with data.frames and other types of objects, others (such as sd) will not.
The summarize function can be used to calculate statistics by grouping variable. Here is how it works.
summarize(group_by(example.df, id), mean(var1), sd(var1))## # A tibble: 4 × 3
## id `mean(var1)` `sd(var1)`
## <fctr> <dbl> <dbl>
## 1 a 0.4571488 0.3357088
## 2 b 0.5290296 0.1356012
## 3 c 0.6233652 0.2999268
## 4 d 0.5524517 0.3741262
You can group by multiple variables:
summarize(group_by(example.df, id, var3), mean(var1), sd(var1))## Source: local data frame [16 x 4]
## Groups: id [?]
##
## id var3 `mean(var1)` `sd(var1)`
## <fctr> <int> <dbl> <dbl>
## 1 a 1 0.8843358 NA
## 2 a 2 0.5557704 NA
## 3 a 3 0.2560863 NA
## 4 a 4 0.1324027 NA
## 5 b 5 0.4816932 NA
## 6 b 6 0.6519777 NA
## 7 b 7 0.3590353 NA
## 8 b 8 0.6234120 NA
## 9 c 9 0.9010314 NA
## 10 c 10 0.5467439 NA
## 11 c 11 0.8115376 NA
## 12 c 12 0.2341479 NA
## 13 d 13 0.9571786 NA
## 14 d 14 0.1186927 NA
## 15 d 15 0.3832028 NA
## 16 d 16 0.7507328 NA
Earlier we learned how to write a data set to a file. But what if we want to write something that isn’t in a nice rectangular format, like the output of summary? For that we can use the sink() function:
sink(file="output.txt", split=TRUE) # start logging
print("This is the summary of example.df \n")## [1] "This is the summary of example.df \n"
print(summary(example.df))## id var1 var2 var3 var4
## a:4 Min. :0.1187 Min. :0.0331 Min. : 1.00 a:4
## b:4 1st Qu.:0.3333 1st Qu.:0.1111 1st Qu.: 4.75 b:4
## c:4 Median :0.5513 Median :0.2291 Median : 8.50 c:4
## d:4 Mean :0.5405 Mean :0.3250 Mean : 8.50 d:4
## 3rd Qu.:0.7659 3rd Qu.:0.3649 3rd Qu.:12.25
## Max. :0.9572 Max. :0.9572 Max. :16.00
sink() ## sink with no arguments turns logging offbirths.by.year.name.length.by.location.sum(baby.names$Count)## [1] 76865321
sum(filter(baby.names, Location == "MA")$Count)## [1] 1232841
births.by.year <- summarize(group_by(baby.names, Year), sum(Count))
mean(baby.names$Name.length)## [1] 5.978752
name.length.by.location <- summarize(group_by(baby.names, Location), mean(Name.length))Thanks to classes and methods, you can plot() many kinds of objects:
plot(example.df$var4)Thanks to classes and methods, you can plot() many kinds of objects:
plot(select(example.df, id, var1))Thanks to classes and methods, you can plot() many kinds of objects:
plot(select(example.df, id, var4))plot(select(example.df, var1, var2))